An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures

نویسندگان

  • Christine Morin
  • Anne-Marie Kermarrec
  • Michel Banâtre
  • Alain Gefflaut
چکیده

Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel applications. Made up of a large number of components , these architectures have however a high probability of failure. We propose a protocol to tolerate node failures in two classes of dsm architectures: Cache Only Memory Architectures (coma) and Distributed Virtual Shared Memory (svm) systems. The proposed solution is based on backward error recovery and consists of an extension to the existing coherence protocols to manage data used by processors for the computation and recovery data, used for fault tolerance. The implementation of the protocol in a coma architecture has been evaluated by simulation. The protocol has also been implemented in a svm system on a network of workstations. Both simulation results and measurements show that our solution is eecient and scalable. Une approche eecace et extensible pour mettre en uvre des architectures a m emoire partag ee r epartie tol erantes aux fautes R esum e : Les architectures a m emoire partag ee r epartie sont attrayantes pour l'ex ecution d'applications parall eles a haute performance. Compos ees d'un grand nombre d' el ements, ces architectures ont cependant une probabilit e de d efaillance tr es elev ee. Nous proposons un protocole reposant sur le retour arri ere, pour to-l erer les d efaillances des nnuds dans deux classes d'architectures a m emoire par-tag ee r epartie : les Cache Only Memory Architectures (COMA) et les syst emes a m emoire virtuelle partag ee (MVP). Sa conception repose sur l'exploitation des ca-ract eristiques des architectures consid er ees qui permet de stocker les donn ees de r ecup eration, celles restaur ees en cas de retour arri ere, en m emoire vive assurant un etablissement et une restauration rapide de points de r ecup eration tout en evitant le d eveloppement de mat eriel sp eciique co^ uteux. La solution propos ee est fond ee sur un protocole de r ecup eration arri ere et consiste en une extension du protocole de co-h erence pour g erer a la fois les donn ees utilis ees par les processeurs et les donn ees de r ecup eration. L'impact de l'utilisation du protocole dans une architecture COMA a et e evalu e par simulation. Le protocole a egalement et e mis en uvre au sein d'une m emoire virtuelle partag ee sur un r eseau de stations de …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)

Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...

متن کامل

Novel efficient fault-tolerant full-adder for quantum-dot cellular automata

Quantum-dot cellular automata (QCA) are an emerging technology and a possible alternative for semiconductor transistor based technologies. A novel fault-tolerant QCA full-adder cell is proposed: This component is simple in structure and suitable for designing fault-tolerant QCA circuits. The redundant version of QCA full-adder cell is powerful in terms of implementing robust digital functions. ...

متن کامل

Novel efficient fault-tolerant full-adder for quantum-dot cellular automata

Quantum-dot cellular automata (QCA) are an emerging technology and a possible alternative for semiconductor transistor based technologies. A novel fault-tolerant QCA full-adder cell is proposed: This component is simple in structure and suitable for designing fault-tolerant QCA circuits. The redundant version of QCA full-adder cell is powerful in terms of implementing robust digital functions. ...

متن کامل

Towards a Robust and Fault-Tolerant Multicast Discovery Architecture for Global Computing Grids

Global grid systems with potentially millions of services require a very effective and efficient service discovery/location mechanism. Current grid environments, due to their smaller size, rely mainly on centralised service directories. Large-scale systems need a decentralised service discovery system that operates reliably in a dynamic and error-prone environment. Work has been done in studyin...

متن کامل

Fault Tolerant Reversible QCA Design using TMR and Fault Detecting by a Comparator Circuit

Quantum-dot Cellular Automata (QCA) is an emerging and promising technology that provides significant improvements over CMOS. Recently QCA has been advocated as an applicant for implementing reversible circuits. However QCA, like other Nanotechnologies, suffers from a high fault rate. The main purpose of this paper is to develop a fault tolerant model of QCA circuits by redundancy in hardware a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Trans. Computers

دوره 49  شماره 

صفحات  -

تاریخ انتشار 2000